This workshop is mainly adapted from the mixOmics tutorials (http://mixomics.org) and the Bioconductor vignette (https://bioconductor.org/packages/release/bioc/html/mixOmics.html).

1 Introduction

This session introduces dimensionality reduction methods for two matched (two-omic) datasets obtained from the same samples.

The main purpose of Canonical Correlations Analysis (CCA) is the exploration of sample correlations between two sets of variables observed on the same individuals whose roles in the analysis are strictly symmetric.

Regularised canonical correlation analysis (rCCA) is a statistical method that extends canonical correlation analysis (CCA) by adding a regularisation term to the objective function. Regularisation helps to prevent overfitting, which is a problem that can occur when a model is too complex and learns the noise in the data instead of the true underlying relationships.

rCCA can be used to find linear relationships between two sets of variables, even when the data is noisy or has a large number of variables.

This session will also cover projection to latent structures (PLS) regression, which builds components from two datasets to maximise covariance.

2 Load example dataset

library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Loaded mixOmics 6.24.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us:  citation('mixOmics')
data(breast.TCGA)

# Extract training data and name each data frame
# There are 3 dataframes (3 omics of the same samples), stored as list
# In each dataframe, the rows represent samples.
# Columns represent feature measurments (genes, miRNA, protein)
X <- list(miRNA = breast.TCGA$data.train$mirna, 
          mRNA = breast.TCGA$data.train$mrna, 
          protein = breast.TCGA$data.train$protein)

Y <- breast.TCGA$data.train$subtype
table(Y)
## Y
## Basal  Her2  LumA 
##    45    30    75

3 Raw correlation

We can use imgCor() to view the raw correlation between sets of variables.

In this example x11() is used to capture the image in an XQuartz window, as a workaround to limited viewing space in RStudio. It is not a necessary plot.

# produce a heat map of the cross correlation matrix
x11()
imgCor(X$miRNA, X$mRNA,
        type = "combine", keysize = 0.05,
        sideColors = color.mixo(4:5), row.cex = 0.25, col.cex = 0.5) 
savePlot("correlationPlot.miRNA.mRNA.png", type = "png")

imgCor(X$mRNA, X$protein,
        type = "combine", keysize = 0.05,
        sideColors = color.mixo(5:6), row.cex = 0.25, col.cex = 0.5) 
savePlot("correlationPlot.mRNA.protein.png", type = "png")

imgCor(X$mRNA, X$protein,
        type = "combine", keysize = 0.05,
        sideColors = color.mixo(c(4,6)), row.cex = 0.25, col.cex = 0.5) 
savePlot("correlationPlot.miRNA.protein.png", type = "png")
rCCA raw correlation matrix (mRNA and protein)
rCCA raw correlation matrix (mRNA and protein)

4 rCCA tuning procedures

mixOmics implements rCCA with (a) the Cross-Validated Ridge or (b) Shrinkage procedures to overcome overfitting, or finding false patterns/correlations from many comparisons. The cv ridge approach adds a penalty term of rCCA to prevent overfitting, whereas the shrinkage approach reduces estimates of the canonical coefficients towards zero. Generally CV-Ridge does not shrink the coefficients towards zero as much as the shrinkage method. The shrinkage method is more effective at preventing overfitting, but it also makes the model less powerful. CV-Ridge is computationally much slower than Shrinkage and may not work for a large amount of variables (>5000), but has advantages in preserving the power of the model.

4.1 CV-Ridge

# set grid search values for each regularisation parameter
grid1 <- seq(0.001, 0.2, length = 15) 
grid2 <- seq(0.001, 0.2, length = 15)

# (slow step)
cv.tune.rcc.mRNA.protein <- tune.rcc(X$mRNA, X$protein, grid1 = grid1, grid2 = grid2,
                                   validation = "Mfold", folds = 6, plot = TRUE) 

saveRDS(cv.tune.rcc.mRNA.protein, file = "cv.tune.rcc.mRNA.protein.rds")
L1 and L2 optimisation by grid search
L1 and L2 optimisation by grid search

Note that this step can be time-consuming. The output object can be saved as an rds file, which you can load into a later session.

cv.tune.rcc.mRNA.protein <- readRDS("cv.tune.rcc.mRNA.protein.rds")

Examine output of cross-validation tuning.

cv.tune.rcc.mRNA.protein # examine the results of CV tuning
## 
## Call:
##  tune.rcc(X = X$mRNA, Y = X$protein, grid1 = grid1, grid2 = grid2, validation = "Mfold", folds = 6, plot = TRUE) 
## 
##   lambda1 =  0.1573571 , see object$opt.lambda1
##   lambda2 =  0.01521429 ,  see object$opt.lambda2
##  CV-score =  0.814176 , see object$opt.score

Get the optimised L1, L2

opt.L1 <- cv.tune.rcc.mRNA.protein$opt.lambda1 # extract the optimal lambda values
opt.L2 <- cv.tune.rcc.mRNA.protein$opt.lambda2


# formed optimised CV rCCA
cv.rcc.mRNA.protein <- rcc(X$mRNA, X$protein, method = "ridge",
                         lambda1 = opt.L1, lambda2 = opt.L2)

4.2 Shrinkage

shrink.rcc.mRNA.protein <- rcc(X$mRNA,X$protein, method = 'shrinkage') 

# examine the optimal lambda values after shrinkage 
shrink.rcc.mRNA.protein$lambda 
##   lambda1   lambda2 
## 0.1210357 0.1504092

5 rCCA: Results

From this point onwards, the steps are the same for CV-Ridge and Shrinkage rCCA models. For simplicity, we’ll carry on mainly with the results obtained using the Shrinkage procedure.

5.1 Component Plot

# barplot of shrinkage method rCCA canonical correlations
plot(shrink.rcc.mRNA.protein, type = "barplot", main = "Shrinkage") 

  • Barplots show canonical correlation values for each novel dimension.

5.2 Sample Plot

The plotIndiv() function in this context can display the the samples in “XY-space” (X and Y referring to the two datasets, X$mRNA and X$protein).

# plot the projection of samples for CV-Ridge rCCA data
plotIndiv(cv.rcc.mRNA.protein, comp = 1:2, 
          ind.names = row.names(X$mRNA),
          group = Y, rep.space = "XY-variate", 
          legend = TRUE, title = 'rCCA CV-Ridge XY-space', legend.title = "Cancer\nSubtype")

# plot the projection of samples for CV-Ridge rCCA data
plotIndiv(shrink.rcc.mRNA.protein, comp = 1:2, 
          ind.names = row.names(X$mRNA),
          group = Y, rep.space = "XY-variate", 
          legend = TRUE, title = 'rCCA Shrinkage XY-space', legend.title = "Cancer\nSubtype")

5.3 Arrow Plot

  • Arrow plots are similar to the above Sample plots, but connect the subspace of each block.
  • Plots with very short arrows suggest the datasets are concordant.
# arrowplot
plotArrow(shrink.rcc.mRNA.protein, group = Y, 
          col.per.group = color.mixo(1:3),
          title = 'rCCA', legend.title = "Cancer\nSubtype")

5.4 Correlation Circle Plot

plotVar(shrink.rcc.mRNA.protein, var.names = c(FALSE, FALSE),
        cex = c(4, 4), cutoff = 0.5, col = c("black", "black"), overlap = FALSE,
        title = 'rCCA mRNA + protein')

5.5 Relevance network plot

network(shrink.rcc.mRNA.protein, comp = 1:2, interactive = FALSE,
        lwd.edge = 2,graph.scale = 0.25,
        cutoff = 0.5, color.node = color.mixo(5:6),
        save = "png", name.save = "network")

# focus on component 1
network(shrink.rcc.mRNA.protein, comp = 1, interactive = FALSE,
        lwd.edge = 2,graph.scale = 0.4,
        cutoff = 0.6, color.node = color.mixo(5:6),
        save = "png", name.save = "network.comp1")

# focus on component 2
network(shrink.rcc.mRNA.protein, comp = 2, interactive = FALSE,
        lwd.edge = 2,graph.scale = 0.4,
        cutoff = 0.25, color.node = color.mixo(5:6),
        save = "png", name.save = "network.comp2")
rCCA network map: component 1+2
rCCA network map: component 1+2
rCCA network map: component 1
rCCA network map: component 1
rCCA network map: component 2
rCCA network map: component 2

When applying cim() to integration methods (two omics), the mapping parameter controls what association matrix is used for graphing. By default, it will use the combined association matrix (mapping = “XY”). This means that this time each cell in the heat map represents the correlation between a feature from each ’omic dataset. If mapping = “X” or mapping = “Y” is used, each cell represents the raw expression data from the input X or Y dataset respectively.

cim(shrink.rcc.mRNA.protein, cutoff = 0.4, comp = 1:2, 
    xlab = "Protein", ylab = "mRNA", 
    col.cex = 0.5, row.cex=0.5, margins=c(6,5),
    save = "png", name.save = "rcc.mRNA.protein.cim")
rCCA cim
rCCA cim

5.6 Practice and Discussion

  • Perform rCCA using miRNA + mRNA data, including an arrow plot and correlation heatmap.

6 PLS regression

In the previous session we used PLSDA on a single-omic dataset, which found latent components (built from a linear combination of key features) that could drive apart the experimental groups of samples (categorical outcome).

‘Partial Least Squares’ or ‘Projection to Latent Structures’ (PLS) regression aims to find predictive features of a continuous outcome in another dataset, e.g. in this example data perhaps there are gene expression patterns that track very well with the expression of certain proteins, or miRNA that track well with the mRNA that they regulate.

Types

  • PLS type 1: is for univariate analysis (one feature, against a matrix of features)
  • PLS type 2: is for multivariate analysis (a matrix of features, against a matrix of features)

Modes

  • Regression: X and Y are asymmetric (X predicts Y)
  • Canonical: X and Y are symmetric (X and Y influence each other equally). Similar to CCA
  • Invariant: no matrix deflation, allows a reduncancy analysis
  • Classic: similar to regression, using different normalisations
# start with a basic sPLS model
spls.miRNA.RNA <- spls(X = X$miRNA, Y = X$mRNA, ncomp = 5, mode = 'regression')

6.1 Tuning

First, we need to optimise the number of components in the model.

# repeated CV tuning of component count
perf.spls.miRNA.RNA <- perf(spls.miRNA.RNA, validation = 'Mfold',
                         folds = 6, nrepeat = 5) 

plot(perf.spls.miRNA.RNA, criterion = 'Q2.total')

  • The horizontal line gives a recommended cutoff for dimension selection at Q2=0.0975
  • Choosing more components below this line is unlikely to improve the model.
  • 1 component might be optimal here, but we can increase to 2 components in order to view 2d output plots.
# set range of test values for number of variables to use from X dataframe
list.keepX <- 2^(3:7)
list.keepX
## [1]   8  16  32  64 128
# set range of test values for number of variables to use from Y dataframe
# we may be motivated to have keepY low, so set a lower test range
list.keepY <- c(3:10) 

tune.spls.miRNA.RNA <- tune.spls(X$miRNA, X$mRNA, ncomp = 2,
                              test.keepX = list.keepX,
                              test.keepY = list.keepY,
                              nrepeat = 1, folds = 6,
                              mode = 'regression', measure = 'cor') 
plot(tune.spls.miRNA.RNA)         # use the correlation measure for tuning

  • The averaged correlation coefficients between latent variables are plotted
  • Larger circles = higher mean correlation
  • Optimal values (green box) are maximising mean correlation
  • It is recommended to repeat this step with finer testing grids to improve the approximation
# set range of test values for number of variables to use from X dataframe
list.keepX <- round(2^(seq(2,6,0.5)))
list.keepX
## [1]  4  6  8 11 16 23 32 45 64
# set range of test values for number of variables to use from Y dataframe
# we may be motivated to have keepY low, so set a lower test range
list.keepY <- c(8:12) 

tune.spls.miRNA.RNA <- tune.spls(X$miRNA, X$mRNA, ncomp = 2,
                              test.keepX = list.keepX,
                              test.keepY = list.keepY,
                              nrepeat = 1, folds = 6,
                              mode = 'regression', measure = 'cor') 
plot(tune.spls.miRNA.RNA)         # use the correlation measure for tuning

We can take the optimal values from these test results, and make a more better model.

# extract optimal number of variables for X dataframe
optimal.keepX <- tune.spls.miRNA.RNA$choice.keepX 

# extract optimal number of variables for Y datafram
optimal.keepY <- tune.spls.miRNA.RNA$choice.keepY

optimal.ncomp <-  length(optimal.keepX) # extract optimal number of components

optimal.keepX
## comp1 comp2 
##     6     6
optimal.keepY
## comp1 comp2 
##    11     9
optimal.ncomp
## [1] 2

Final model:

# use all tuned values from above
final.spls.miRNA.RNA <- spls(X$miRNA, X$mRNA, ncomp = optimal.ncomp, 
                    keepX = optimal.keepX,
                    keepY = optimal.keepY,
                    mode = "regression") # explanatory approach being used, 
                                         # hence use regression mode

6.2 Sample plot

sPLS is unsupervised, but we’ll plot the colour-coded group information as previously.

plotIndiv(final.spls.miRNA.RNA,
          ind.names = FALSE, group = Y, legend=TRUE, 
          legend.title = "Cancer\nSubtype")

# plot with rep.space = "XY-variate" (averaged component subspace)
plotIndiv(final.spls.miRNA.RNA, rep.space = "XY-variate",
          ind.names = FALSE, group = Y, legend=TRUE, title = "sPLS miRNA:mRNA",
          legend.title = "Cancer\nSubtype")

  • Component 1 seems to be distinguishing the three groups, wheras Component 2 is not adding much help.

6.3 Arrow plot

The length of arrows will display the level of agreement between datasets.

plotArrow(final.spls.miRNA.RNA, rep.space = "XY-variate",
          ind.names = FALSE, group = Y, legend=TRUE, title = "sPLS miRNA:mRNA",
          legend.title = "Cancer\nSubtype")

6.4 Variable plot

“Stability” of feature selection should be inspected. We can use the perf() function.

# form new perf() object which utilises the final model
perf.spls.miRNA.RNA <- perf(final.spls.miRNA.RNA, 
                          folds = 6, nrepeat = 30, # use repeated cross-validation
                          validation = "Mfold", cpus=3,
                          dist = "max.dist",  # use max.dist measure
                          progressBar = FALSE)
## Warning: The SGCCA algorithm did not converge
# plot the stability of each feature for the first two components, 
# 'b' type refers to plotting histogram
par(mfrow=c(1,2)) 
plot(perf.spls.miRNA.RNA$features$stability.X$comp1, type = 'h',
     ylab = 'Stability',
     xlab = 'Features',
     main = 'Comp 1', las =2,
     ylim = c(0,1), xlim = c(0, 30))
plot(perf.spls.miRNA.RNA$features$stability.X$comp2, type = 'h',
     ylab = 'Stability',
     xlab = 'Features',
     main = 'Comp 2', las =2,
     ylim = c(0,1), xlim = c(0, 30))

plotVar(final.spls.miRNA.RNA, cex = c(2,3), var.names = c(FALSE, TRUE), cutoff = 0.6)

  • This plot shows most of the power is in Component 1, since most of the selected Genes (orange) are splitting across component 1 in concordance or opposition to a set of miRNA (blue symbols).
  • A couple of genes are shown in Component 2 (influenced also by Component 1 as they are on the diagonal).

6.5 Network

color.edge <- color.GreenRed(50)  # set the colours of the connecting lines

network(final.spls.miRNA.RNA, comp = 1:2,
        cutoff = 0.6, # only show connections with a correlation above 0.6
        shape.node = c("rectangle", "circle"),
        color.node = color.mixo(4:5),
        size.node = 0.3,
        color.edge = color.edge,
        save = 'png', # save as a png to the current working directory
        name.save = 'sPLS.miRNA.mRNA.network')
sPLS network
sPLS network

6.6 Clustered Image Map

cim(final.spls.miRNA.RNA, comp = 1:2, xlab = "mRNA", ylab = "miRNA",
    margins = c(5,8), 
    cutoff = 0.6, save = "png", name.save = "sPLS.miRNA.mRNA.cim")

6.7 Practice

  • We can practice this method further using mRNA + protein data.
  • Let’s say that there is a particular protein of high interest: GATA3. Can we perform a focused PLS (type-1) regression to find a set of genes that track very strongly with/against the abundance level of this protein?

7 Discussion and Recap

  • rCCA - identify correlation structures between two sets of matching data (the same samples measured with two omics).
  • PLS regression - identify features that track well with features in another matching set of data.